By the end of this practical lab you will be able to:
This practical lab concerns integrating the web with R, both as a source of data or as an analytics platform. These connections utilize Application Programming Interfaces (API) which enable data queries or analytics to be run, returning the results within R. Much of the complexity to these interfaces are hidden by the R packages that will be demonstrated here and therefore are quite accessible.
One of the simplest ways in which you can read data from the web is by using some of the same functionality for reading local files. For example, we can read a CSV file of municipal swimming pool locations in Brisbane, Australia as follows:
# The local file location is swapped for a remote url
swimming_pools <- read.csv("https://www.data.brisbane.qld.gov.au/data/dataset/ccf67d3e-cfaf-4d30-8b78-a794c783af9f/resource/c09546c8-9526-4358-a1eb-81dbb224cdca/download/Pools-location-and-information-09Dec16.csv")
#Show the top six rows
head(swimming_pools)
## Name Address
## 1 Acacia Ridge Leisure Centre 1391 Beaudesert Road, Acacia Ridge
## 2 Bellbowrie Pool 47 Birkin Road, Bellbowrie
## 3 Carole Park Swim Centre Cnr Boundary Road and Waterford Road Wacol
## 4 Centenary Pool (Spring Hill) 400 Gregory Terrace, Spring Hill
## 5 Chermside Pool 375 Hamilton Road, Chermside
## 6 Colmslie Pool (Morningside) 400 Lytton Road, Morningside
## Phone_No Phone_No_2
## 1 3277 8686
## 2 3202 6620
## 3 1300 332 583 3271 6116
## 4 1300 332 583
## 5 1300 252 583
## 6 1300 733 053
## Opening_Hours
## 1 SUMMER HOURS (from 17 September 2016)\nMonday to Thursday: 6am-7pm\nFriday: 6am-6pm\nSaturday: 8am-5pm\nSunday: 9am-5pm\nPublic holidays: 12pm-5pm. Closed Christmas Day, Good Friday and Anzac Day.\n\nOutdoor Lagoon Pool\nSaturday and Sunday: 11am-5pm\nSchool Holidays: 11am-5pm daily
## 2 Opening Early for the Summer Season from Monday 5th September\nMonday \x96 Thursday 8am \x96 11.30am\n3.30pm \x96 5.30pm\nFriday \x96 8am \x96 12pm\nSaturday & Sunday 8am \x96 6pm
## 3 SUMMER HOURS (17 September 2016 to 29 March 2017)\nMonday to Friday: 5.30am-7.30am and 11am-6pm\nSaturday: 10am-4pm\nSunday: 10am-4pm\nPublic holidays: 11am-4pm\nClosed Christmas Day and Good Friday\n\nWINTER HOURS\nClosed 30 March to 16 September 2016. Re-opens 17 September 2016.
## 4 NORMAL HOURS \nPool and health club hours (excluding dive pool) \nMonday - Thursday: 5am-8pm\nFriday: 5am-6pm\nSaturday to Sunday: 7am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day and Good Friday.\n\nDive pool (opening hours for public use)\nMonday-Friday: closed\nSaturday: 1-3pm\nSunday and school holidays: 11.30am-1.30pm\nPublic holidays: closed
## 5 SUMMER HOURS (from 17 September 2016)\nMonday to Thursday: 5am-8pm\nFriday: 5am-7pm\nSaturday and Sunday: 7am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day, Good Friday and Anzac Day.
## 6 This pool is open all year round, except for the Kids Fun Pool pool which is only open during summer (September - April).\n\nVENUE HOURS\nMonday - Thursday: 5:30am-8pm\nFriday: 5:30am-6pm\nSaturday: 7.30am-6pm\nSunday: 8am-6pm\nPublic holidays: 9am-5pm. Closed Christmas Day, Good Friday and Anzac Day.
## Facilities
## 1 Aqua aerobics, Disabled Access/Facilities, Enclosed Program pool, Indoor heated Pool, Outdoor pool, Lifeguards, Open in winter, Squad Swimming, Swimming lessons, Water play
## 2 Disabled Access/Facilities, Heated pool, Indoor Pool, Lifeguards, Outdoor pool, Squad Swimming, Stroke Development, Swimming lessons, Wading pool, Water play, Caf\xe9
## 3 Aqua-aerobics, Heated pool, Kiosk, Lifeguards, Outdoor pool, Swimming lessons, Wading pool
## 4 Aqua aerobics, Diving, Gym, Heated pool, Kiosk, Open in winter, Outdoor pool, Squad Swimming, Stroke Development, Swim Fit, Swimming lessons, Wading pool, caf\xe9, water polo
## 5 Aqua aerobics, Disabled Access/Facilities, Heated pool, Indoor Pool, Kiosk, Leisure Centre/Water Park, Lifeguards, Open in winter, Outdoor pool, Squad Swimming, Stroke Development, Swimming lessons, Water play
## 6 Aqua aerobics, Disabled Access/Facilities, Heated pool, Indoor Pool, Kiosk, Lifeguards, Open in winter, Outdoor pool, Swimming lessons, Wading pool, Water play
## Disability_Access Parking
## 1 Yes Free Car parking: 120 spaces; Public transport: Bus
## 2 Yes Free Car Parking ; Public transport: Buses
## 3 No Free Car parking; Public transport: Bus & train
## 4 Yes Car parking available; Public Transport
## 5 Yes Car park available; Public transport: Bus
## 6 Yes Car park available
## Latitude Longitude
## 1 -27.58616 153.0264
## 2 -27.56547 152.8911
## 3 -27.60744 152.9315
## 4 -27.45537 153.0251
## 5 -27.38583 153.0351
## 6 -27.45516 153.0789
Reading special file formats such as JSON require additional packages such as jsonlite(). In this section, we will use this library to retrieve a JSON file from a Web API. First install and load jsonlite:
#Install jsonlite
install.packages("jsonlite")
#Load Package
library(jsonlite)
Generally a Web API is a service that receives requests or queries from users and returns a result via a web protocol (mainly HTTP). In this way, users can ask for and use data even without knowing how data are stored and processed. Due to the popularity of JavaScript in the WWW, JSON has become the most popular file format served by Web APIs.
In the following example we pull live station data from the San Francisco bike share scheme:
bikes<- fromJSON(txt="http://feeds.bayareabikeshare.com/stations/stations.json")
The bikes object is a list; the first entry returning the query time:
bikes[1]
## $executionTime
## [1] "2016-12-24 01:11:12 PM"
And the second element the data, which we will use to create a new data frame object “bikes_SF”
bikes_SF <- data.frame(bikes[2])
head(bikes_SF)
## stationBeanList.id stationBeanList.stationName
## 1 2 San Jose Diridon Caltrain Station
## 2 3 San Jose Civic Center
## 3 4 Santa Clara at Almaden
## 4 5 Adobe on Almaden
## 5 6 San Pedro Square
## 6 7 Paseo de San Antonio
## stationBeanList.availableDocks stationBeanList.totalDocks
## 1 15 27
## 2 5 15
## 3 9 11
## 4 16 19
## 5 10 15
## 6 5 15
## stationBeanList.latitude stationBeanList.longitude
## 1 37.32973 -121.9018
## 2 37.33070 -121.8890
## 3 37.33399 -121.8949
## 4 37.33141 -121.8932
## 5 37.33672 -121.8941
## 6 37.33380 -121.8869
## stationBeanList.statusValue stationBeanList.statusKey
## 1 In Service 1
## 2 In Service 1
## 3 In Service 1
## 4 In Service 1
## 5 In Service 1
## 6 In Service 1
## stationBeanList.status stationBeanList.availableBikes
## 1 IN_SERVICE 12
## 2 IN_SERVICE 10
## 3 IN_SERVICE 2
## 4 IN_SERVICE 3
## 5 IN_SERVICE 4
## 6 IN_SERVICE 10
## stationBeanList.stAddress1 stationBeanList.stAddress2
## 1 San Jose Diridon Caltrain Station
## 2 San Jose Civic Center
## 3 Santa Clara at Almaden
## 4 Adobe on Almaden
## 5 San Pedro Square
## 6 Paseo de San Antonio
## stationBeanList.city stationBeanList.postalCode stationBeanList.location
## 1 San Jose Crandall Street
## 2 San Jose W San Carlos Street
## 3 San Jose W Santa Clara Street
## 4 San Jose Almaden Boulevard
## 5 San Jose N San Pedro Street
## 6 San Jose Paseo de San Antonio
## stationBeanList.altitude stationBeanList.testStation
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## stationBeanList.lastCommunicationTime stationBeanList.landMark
## 1 2016-12-24 13:10:44 San Jose
## 2 2016-12-24 13:10:08 San Jose
## 3 2016-12-24 13:07:11 San Jose
## 4 2016-12-24 13:09:14 San Jose
## 5 2016-12-24 13:10:48 San Jose
## 6 2016-12-24 13:08:07 San Jose
## stationBeanList.is_renting
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
Although we covered some aspects of using web enabled infrastructure to conduct remote queries previously (see 2. Data Manipulation in R), there are an array of ways in which different services can be utilized from within R. Here we will explore the ggmap package, which extends the mapping capabilities of ggplot.
In the previous section we created a list of Twitter accounts based on followers of the City of Boulder, CO Twitter account and have limited these to those with user specified locations. If you view these details, it is obvious that these are of variable quality (in terms of actually being places), however, a substantial proportion do relate to geographic locations. First install and load ggmap:
install.packages("ggmap")
library(ggmap)
## Loading required package: ggplot2
We will now write some code that will attempt to geocode the locations. First we will extract a list of locations and their frequency:
# List frq table of locations
Locations <- data.frame(table(followers_BCO_Details_GEO$location))
# Sort in decending order
Locations <- Locations[order(-Locations$Freq),]
The distribution has a very long tail, with many locations appearing only once. Using the Google geocoding API this has a call limit of 2500 so we will first select all those locations with a frequency over 2 which results in 1151 records. We will then add a random sample of -1349 of the locations with a single frequency.
# create a sample of locations with a frequency over 1
A <- Locations[Locations$Freq > 1,]
# create a sample of locations with a frequency of 1
B <- Locations[Locations$Freq == 1,]
#Randomly select rows that when added to A will make the total rows 2500
B <- B[sample(1:nrow(B),(2500 - nrow(A))),]
#Combine the two together and keep just the locations
sample_locations <- as.character(rbind(A,B)[,"Var1"])
#Show the first six locations
head(sample_locations)
## [1] "Boulder, CO" "Denver, CO" "Boulder, Colorado"
## [4] "Colorado" "Colorado, USA" "United States"
Geocoding is managed very simply with the geocode() function, accepting a character object of names to search. This has been commented out as the geocoding has already been run; and we have saved this in the “U_Locations_Geocode.Rdata” object which we also have loaded:
#geocode sample
# U_Locations_Geocode <- geocode(sample_locations,output="latlon",source="google")
# save(U_Locations_Geocode, file = "./data/U_Locations_Geocode.Rdata")
# Load the geocoding results
load("./data/U_Locations_Geocode.Rdata")
Next we need to join the locations to the sample locations - these align as the places within “sample_locations” were geocoded in the order that they appear in the data frame object. As such we can use cbind() to “column bind” the two objects together:
# Column bind the two data frame object
sample_locations_geocoded <- cbind(sample_locations,U_Locations_Geocode)
# Show the first 6 rows
head(sample_locations_geocoded)
## sample_locations lon lat
## 1 Boulder, CO -105.27055 40.01499
## 2 Denver, CO -104.99025 39.73924
## 3 Boulder, Colorado -105.27055 40.01499
## 4 Colorado -105.78207 39.55005
## 5 Colorado, USA -105.78207 39.55005
## 6 United States -95.71289 37.09024
We can then append the geocoded results back onto the Locations object:
# Append the geocoded locations
Locations_GEO <- merge(Locations, sample_locations_geocoded, by.x="Var1",by.y="sample_locations",all.x = TRUE)
# Remove all the records with no locations
Locations_GEO <- Locations_GEO[!is.na(Locations_GEO$lat),]
# Change the column names
colnames(Locations_GEO) <- c("location","frequency","lon","lat")
We can have a look at these on a map, which shows the main cluster of location clustered around Boulder which is what you might expect.
## Loading required package: sp
## rgdal: version: 1.2-5, (SVN revision 648)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.1.2, released 2016/10/24
## Path to GDAL shared files: /opt/local/share/gdal
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: (autodetected)
## Linking to sp version: 1.2-4
Although we have previously covered various ways in which we can create maps in R; it is often helpful if we can pull down background maps to help illustrate our cartography. This relies on API again, however these are hidden within the R functions.
We will be using some data from AirBnB concerning the locations of property that has been identified by the owners as being within Manhattan, NYC. We will first read in these data.
# Read in data
listings <- read.csv("./data/listings.csv")
# Calculate a price per bed
listings$price_beds <- listings$price / listings$beds
#Show top six rows
head(listings)
## id latitude longitude property_type room_type accommodates
## 1 2082223 40.71031 -74.01638 Apartment Private room 1
## 2 2986941 40.71728 -74.01524 Apartment Entire home/apt 2
## 3 1712688 40.71177 -74.01730 Apartment Entire home/apt 2
## 4 845495 40.70743 -74.01732 Apartment Entire home/apt 4
## 5 2373737 40.70773 -74.01754 Apartment Entire home/apt 4
## 6 1777007 40.71078 -74.01623 Apartment Shared room 1
## bathrooms bedrooms beds bed_type price review_scores_rating price_beds
## 1 1 1 1 Real Bed 80 90 80
## 2 1 1 1 Real Bed 300 100 300
## 3 1 0 1 Real Bed 400 NA 400
## 4 1 1 2 Real Bed 250 97 125
## 5 1 1 1 Real Bed 255 93 255
## 6 NA 1 1 Couch 43 94 43
To plot a base map we use the getmap() function which requires a number of input parameters including “centre” which is a latitude, longitude pair for the centre of the map. For this, we will take mean of the property locations to centre the map. The other parameter required is “zoom” which sets the scale of the map (low number = globe; high number = close to streets). The “maptype” controls the tileset used for the map.
map <- get_map(c(mean(listings$longitude),mean(listings$latitude)),zoom=13,maptype = "roadmap")
P <- ggmap(map) # Note we have stored the basic map in the new object P
P
Another way in which we can setup a map is using a keyword rather than a specific lat/lon. For example, the following example will give you a map of Singapore.
ggmap(get_map("Singapore",zoom=12,maptype = "roadmap"))
As shown in the previous tutorial, we can control elements of the plot within gglpot, and the same is true for ggmap. For example, if we want to hide the axis:
# Add a series of options onto the previously created object P
P + theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.title=element_blank(),
axis.ticks = element_blank(),
legend.key = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
We can now add the listings (points) to the map - each one has a latitude and longitude co-ordinate. To begin with we will just show the location of the points. We use the “size” option to adjust the point size.
P + geom_point(data=listings, aes(x=longitude, y=latitude),size=2)
You will see this produces a map, however, also creates a warning about missing values - don’t worry, this is just telling you that not all the rows of data in the data frame are visible on the map. You could make this go away if you change the zoom level - i.e. create a map with a greater geographic extent.
You can also adjust the color of the points using the “color” option.
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2)
Because the price per bed is a continuous variable, the points are now scaled along a color gradient from the highest to lowest values. However, this doesn’t show you very much, as most of the values are clustered towards the bottom of the range. We can check this by plotting the values as a histogram. Each bar is a $25 bin.
# Plot a histogram
qplot(price_beds, data=listings, geom="histogram",binwidth=25)
There are a number of ways in which we can adjust our map to make it more effective at communicating changes in price. First we will change the color of the scale to one of the color brewer pallets - for this we use the scale_color_gradientn() function.
# Load colorbrewer
library(RColorBrewer)
#Make plot
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"))
Although the color has changed, we still have the issue with values being clustered at the end of the scale. However, there are a number of additional options that we can use to control for this. The first is “limits” which we can use to adjust the minimum and maximum value on the scale. Here we take the range 75-300.
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(75,300))
You may have noticed some grey points on the map - these are the properties with values that are outside the ranges specified. We can hide these using a further option “na.value” which you can either assign a color, or as shown in this example, an NA, which makes them hidden.
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds),size=2) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(100,500),na.value=NA)
We could for example use this technique to just plot the very expensive property, which we will define as between $400-1000.
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=price_beds)) + scale_color_gradientn(colours=brewer.pal(9,"YlOrRd"),limit=c(400,1000),na.value=NA)
We can also use the “scale” option to change the size of the points. For example, we might want to color the points by the bed type, but scale the points by the price.
First of all we will just map the bed type - note that the variable which is attached to is a factor, so ggmap (like ggplot) displays this as a categorical value.
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=bed_type))
We can see that most of the AirBnB listings concern real beds; although there are other types across Manhattan. We can extend this plot to explore how these relate to price. Again, we will focus on more expensive property between $400 - $1000. For this we add the “size” parameter to the aes() and additionally, use a new function scale_size() which controls the range of point sizes used (in this case 3 to 10). You will see that there are two very expensive couches that can be rented!
P + geom_point(data=listings, aes(x=longitude, y=latitude, colour=bed_type,size=price_beds)) + scale_size(range = c(3, 10),limit=c(400,1000))